Semantic Regularities in Document Representations
نویسندگان
چکیده
Recent work exhibited that distributed word representations are good at capturing linguistic regularities in language. This allows vector-oriented reasoning based on simple linear algebra between words. Since many different methods have been proposed for learning document representations, it is natural to ask whether there is also linear structure in these learned representations to allow similar reasoning at document level. To answer this question, we design a new document analogy task for testing the semantic regularities in document representations, and conduct empirical evaluations over several state-of-theart document representation models. The results reveal that neural embedding based document representations work better on this analogy task than conventional methods, and we provide some preliminary explanations over these observations.
منابع مشابه
A Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملMixing Dirichlet Topic Models and Word Embeddings to Make lda2vec
Distributed dense word vectors have been shown to be effective at capturing tokenlevel semantic and syntactic regularities in language, while topic models can form interpretable representations over documents. In this work, we describe lda2vec, a model that learns dense word vectors jointly with Dirichlet-distributed latent document-level mixtures of topic vectors. In contrast to continuous den...
متن کاملLearning semantic structures from in-domain documents
Semantic analysis is a core area of natural language understanding that has typically focused on predicting domain-independent representations. However, such representations are unable to fully realize the rich diversity of technical content prevalent in a variety of specialized domains. Taking the standard supervised approach to domainspecific semantic analysis requires expensive annotation ef...
متن کاملDeep Learning of Semantic Word Representations to Implement a Content-Based Recommender for the RecSys Challenge'14
In this paper, we will discuss a recommender system that exploits the semantics regularities captured by a Recurrent Neural Network (RNN) in text documents. Many information retrieval systems treat words as binary vectors under the classic bag-of-words model; however there is not a notion of semantic similarity between words when describing a document in the resulting feature space. Recent adva...
متن کاملLearning Semantic Structures from In - domain
Semantic analysis is a core area of natural language understanding that has typically focused on predicting domain-independent representations. However, such representations are unable to fully realize the rich diversity of technical content prevalent in a variety of specialized domains. Taking the standard supervised approach to domainspecific semantic analysis requires expensive annotation ef...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1603.07603 شماره
صفحات -
تاریخ انتشار 2016